Exploring the Combinatorics of Motif Alignments Foraccurately Computing E-values from P-values

نویسنده

  • T. Eftestøl
چکیده

In biological and biomedical research motif finding tools are important in locating regulatory elements in DNA sequences. There are many such motif finding tools available, which often yield position weight matrices and significance indicators. These indicators, p-values and E-values, describe the likelihood that a motif alignment is generated by the background process, and the expected number of occurrences of the motif in the data set, respectively. The various tools often estimate these indicators differently, making them not directly comparable. One approach for comparing motifs from different tools, is computing the E-value as the product of the p-value and the number of possible alignments in the data set. In this paper we explore the combinatorics of the motif alignment models OOPS, ZOOPS, and ANR, and propose a generic algorithm for computing the number of possible combinations accurately. We also show that using the wrong alignment model can give E-values that significantly diverge from their true values. Keywords—Motif alignment, combinatorics, p-value, E-value, OOPS, ZOOPS, ANR.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Application of Soft Computing Methods for the Estimation of Roadheader Performance from Schmidt Hammer Rebound Values

Estimation of roadheader performance is one of the main topics in determining the economics of underground excavation projects. The poor performance estimation of roadheader scan leads to costly contractual claims. In this paper, the application of soft computing methods for data analysis called adaptive neuro-fuzzy inference system- subtractive clustering method (ANFIS-SCM) and artificial  neu...

متن کامل

Estimation of Reference Values of Biochemical Parameters Exploring the Renal Function in Adults in Ngaoundere, Cameroon

Clinical examinations are accompanied by biological analyzes to guide or confirm the clinical diagnosis. The results of these analyzes are interpreted by comparison with reference values. Studies on the biological norms of Africans are rare if not quasi-nonexistent. The aim of this study was to establish population-specific reference values for biochemical indices serving as renal function biom...

متن کامل

Some remarks on the sum of the inverse values of the normalized signless Laplacian eigenvalues of graphs

Let G=(V,E), $V={v_1,v_2,ldots,v_n}$, be a simple connected graph with $%n$ vertices, $m$ edges and a sequence of vertex degrees $d_1geqd_2geqcdotsgeq d_n>0$, $d_i=d(v_i)$. Let ${A}=(a_{ij})_{ntimes n}$ and ${%D}=mathrm{diag }(d_1,d_2,ldots , d_n)$ be the adjacency and the diagonaldegree matrix of $G$, respectively. Denote by ${mathcal{L}^+}(G)={D}^{-1/2}(D+A) {D}^{-1/2}$ the normalized signles...

متن کامل

On the signed Roman edge k-domination in graphs

Let $kgeq 1$ be an integer, and $G=(V,E)$ be a finite and simplegraph. The closed neighborhood $N_G[e]$ of an edge $e$ in a graph$G$ is the set consisting of $e$ and all edges having a commonend-vertex with $e$. A signed Roman edge $k$-dominating function(SREkDF) on a graph $G$ is a function $f:E rightarrow{-1,1,2}$ satisfying the conditions that (i) for every edge $e$of $G$, $sum _{xin N[e]} f...

متن کامل

Beyond the E-Value: Stratified Statistics for Protein Domain Prediction

E-values have been the dominant statistic for protein sequence analysis for the past two decades: from identifying statistically significant local sequence alignments to evaluating matches to hidden Markov models describing protein domain families. Here we formally show that for "stratified" multiple hypothesis testing problems-that is, those in which statistical tests can be partitioned natura...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009